Take-home Exercise 2

Published

March 3, 2024

1. Introduction

1.1 Context

Dengue Hemorrhagic Fever (in short dengue fever) is one of the most widespread mosquito-borne diseases in the most tropical and subtropical regions. It is an acute disease caused by dengue virus infection which is transmitted by female Aedes aegypti and Aedes albopictus mosquitoes. In 2015, Taiwan had recorded the most severe dengue fever outbreak with more than 43,000 dengue cases and 228 deaths. Since then, the annual reported dengue fever cases were maintained at the level of not more than 200 cases. However, in 2023, Taiwan recorded 26703 dengue fever cases. More than 25,000 cases were reported at Tainan City, and more than 80% of the reported dengue fever cases occurred in the month August-November 2023 and epidemiology week 31-50.

1.2 Objectives

We will be exploring the following:

  • if the distribution of dengue fever outbreak at Tainan City, Taiwan are independent from space and space and time.

  • If the outbreak is indeed spatial and spatio-temporal dependent, and if so discover where are the clusters and outliers, and the emerging hot spot/cold spot areas.

2. Getting Started

2.1 Loading R packages

The R packages that we will be using in this exercise are as follows:

  • sf: for importing and handling geospatial data in R

  • tidyverse: a collection of packages for data science tasks

  • sfdep: Used to compute spatial weights, global and local spatial autocorrelation statistics, and emerging hotspot analysis (EHSA)

  • spdep: Similar to sfdep.

  • tmap: Provides functions for plotting cartographic quality static point patterns maps or interactive maps by using leaflet API.

  • lubridate: For manipulating datetime variables.

  • ggplot2: Allows for simple visualisations like bar charts and line plots.

  • Kendall: Provides the tool to perform Mann-Kendall test

Show code
pacman::p_load(sf, tidyverse, sfdep, spdep, tmap, lubridate, ggplot2, Kendall)

2.2 Datasets

Name Description File Type
TAINAN_VILLAGE Geospatial data of village boundary of Taiwan ESRI Shapefile
Dengue_Daily.csv Aspatial data of reported dengue cases in Taiwan since 1998. CSV Format

2.3 Importing the datasets

The datasets that we will be using are as follow:

We will read the shape file in as a sf data frame using st_read(), transform it using st_transform() to the right CRS which is 3829, then confine the scope to counties specified in the assignment brief using filter() on ‘TOWNID’.

Show code
tainan <- st_read(dsn = "data/geospatial",
                 layer = "TAINAN_VILLAGE") %>% 
  st_transform(crs = 3829) %>%
  filter(TOWNID %in% c("D01","D02","D04","D06","D07","D08","D32","D39"))
Reading layer `TAINAN_VILLAGE' from data source 
  `/Users/jacksontan/Documents/Sashimii0219/IS415-GAA/Take-home_Ex/Take-home_Ex02/data/geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 649 features and 10 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 120.0269 ymin: 22.88751 xmax: 120.6563 ymax: 23.41374
Geodetic CRS:  TWD97
Show code
dengue <- read_csv("data/aspatial/Dengue_Daily.csv")

Importing Tainan Polygons

Show code
tainan <- st_read(dsn = "data/geospatial",
                 layer = "TAINAN_VILLAGE") %>% 
  st_transform(crs = 3829) %>%
  filter(TOWNID %in% c("D01","D02","D04","D06","D07","D08","D32","D39"))
Reading layer `TAINAN_VILLAGE' from data source 
  `/Users/jacksontan/Documents/Sashimii0219/IS415-GAA/Take-home_Ex/Take-home_Ex02/data/geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 649 features and 10 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 120.0269 ymin: 22.88751 xmax: 120.6563 ymax: 23.41374
Geodetic CRS:  TWD97

3. Data Wrangling

3.1 Data Pre-processing - Taiwan Village 2020 Dataset

Show code
glimpse(tainan)
Rows: 258
Columns: 11
$ VILLCODE   <chr> "67000350032", "67000270011", "67000370005", "67000330004",…
$ COUNTYNAME <chr> "臺南市", "臺南市", "臺南市", "臺南市", "臺南市", "臺南市",…
$ TOWNNAME   <chr> "安南區", "仁德區", "中西區", "南區", "安南區", "安南區", "…
$ VILLNAME   <chr> "青草里", "保安里", "赤嵌里", "大成里", "城北里", "城南里",…
$ VILLENG    <chr> "Qingcao Vil.", "Bao'an Vil.", "Chihkan Vil.", "Dacheng Vil…
$ COUNTYID   <chr> "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D",…
$ COUNTYCODE <chr> "67000", "67000", "67000", "67000", "67000", "67000", "6700…
$ TOWNID     <chr> "D06", "D32", "D08", "D02", "D06", "D06", "D08", "D06", "D0…
$ TOWNCODE   <chr> "67000350", "67000270", "67000370", "67000330", "67000350",…
$ NOTE       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ geometry   <POLYGON [m]> POLYGON ((203866.5 2555960,..., POLYGON ((215123.1 …

First look at the dataset reveals that there are several columns that are either in Traditional Chinese or contain IDs that we do not need. In that case, let’s drop those columns and the last 4 letters (spacebar + Vil.) of VILLENG column:

Show code
tainan_clean <- tainan %>%
  mutate(village = substring(VILLENG, 1, nchar(VILLENG)-5)) %>%
  select(1,4,12,11)

In the case of Chinese Pinyin, different variations of similar sounding Chinese words can end up the same when translated to English.

Show code
tainan_clean[duplicated(tainan_clean$village),]
Simple feature collection with 9 features and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 211783.7 ymin: 2538314 xmax: 216840.6 ymax: 2554862
Projected CRS: Hu Tzu Shan 1950 / UTM zone 51N
       VILLCODE VILLNAME   village                       geometry
111 67000350046   溪東里    Xidong POLYGON ((212090.2 2549717,...
112 67000350038   溪北里     Xibei POLYGON ((213637.7 2549689,...
128 67000270012   成功里 Chenggong POLYGON ((214629 2542198, 2...
137 67000320006   仁和里     Renhe POLYGON ((215772.9 2543004,...
146 67000310019   復興里    Fuxing POLYGON ((216584.4 2546984,...
214 67000340008   仁愛里    Ren'ai POLYGON ((214700 2547025, 2...
245 67000310024   成功里 Chenggong POLYGON ((216133 2546922, 2...
246 67000310025   中興里 Zhongxing POLYGON ((216431.3 2547398,...
253 67000350003   塭南里    Wennan POLYGON ((214620.8 2554862,...

We can see that there are 9 such instances. Let’s take a closer look at the first entry.

Show code
tainan_clean[tainan_clean$village == "Xidong",]
Simple feature collection with 2 features and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 209825.5 ymin: 2539524 xmax: 212761.9 ymax: 2549719
Projected CRS: Hu Tzu Shan 1950 / UTM zone 51N
       VILLCODE VILLNAME village                       geometry
82  67000330014   喜東里  Xidong POLYGON ((212211 2540508, 2...
111 67000350046   溪東里  Xidong POLYGON ((212090.2 2549717,...

The first one is called Xi3 Dong, while the second one is Xi1 Dong. We will manually change them by adding their village ID in brackets to one of each duplicates.

Show code
duplicates <- c(111,112,128,137,146,214,245,246,253)
tainan_clean[duplicates,"village"] <- paste0(tainan_clean[duplicates,]$village, 
                                             "(", tainan_clean[duplicates,]$VILLCODE, ")")
Show code
tainan_clean[duplicated(tainan_clean$village),]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: Hu Tzu Shan 1950 / UTM zone 51N
[1] VILLCODE VILLNAME village  geometry
<0 rows> (or 0-length row.names)
Show code
tainan_clean
Simple feature collection with 258 features and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 198147.4 ymin: 2534651 xmax: 221655.9 ymax: 2556729
Projected CRS: Hu Tzu Shan 1950 / UTM zone 51N
First 10 features:
      VILLCODE VILLNAME  village                       geometry
1  67000350032   青草里  Qingcao POLYGON ((203866.5 2555960,...
2  67000270011   保安里   Bao'an POLYGON ((215123.1 2539291,...
3  67000370005   赤嵌里  Chihkan POLYGON ((212263.9 2546464,...
4  67000330004   大成里  Dacheng POLYGON ((211948.4 2544453,...
5  67000350028   城北里 Chengbei POLYGON ((205015.6 2553859,...
6  67000350030   城南里 Chengnan POLYGON ((204553 2554303, 2...
7  67000370009   法華里    Fahua POLYGON ((213068.1 2544770,...
8  67000350017   海南里   Hainan POLYGON ((209616.5 2549009,...
9  67000350049   國安里   Guo'an POLYGON ((210817.8 2549594,...
10 67000350018   溪心里    Xixin POLYGON ((210566.2 2553279,...

No more duplicates!. Next we will check for invalid geometries and missing values.

Show code
test <- st_is_valid(tainan_clean,reason=TRUE)

# Number of invalid geometries
length(which(test!= "Valid Geometry"))
[1] 0
Show code
# Reason
test[which(test!= "Valid Geometry")]
character(0)
Show code
tainan_clean[rowSums(is.na(tainan_clean))!=0,]
Simple feature collection with 0 features and 3 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: Hu Tzu Shan 1950 / UTM zone 51N
[1] VILLCODE VILLNAME village  geometry
<0 rows> (or 0-length row.names)

There are no invalid geometries or missing values, great! Now let’s do a simple visualization to check if we extracted the right data.

Show code
plot(st_geometry(tainan_clean))

3.2 Data Pre-processing - Taiwan Dengue Cases

Show code
dengue
# A tibble: 106,861 × 26
   發病日     個案研判日 通報日     性別  年齡層 居住縣市 居住鄉鎮 居住村里
   <date>     <chr>      <date>     <chr> <chr>  <chr>    <chr>    <chr>   
 1 1998-01-02 None       1998-01-07 男    40-44  屏東縣   屏東市   None    
 2 1998-01-03 None       1998-01-14 男    30-34  屏東縣   東港鎮   None    
 3 1998-01-13 None       1998-02-18 男    55-59  宜蘭縣   宜蘭市   None    
 4 1998-01-15 None       1998-01-23 男    35-39  高雄市   苓雅區   None    
 5 1998-01-20 None       1998-02-04 男    55-59  宜蘭縣   五結鄉   None    
 6 1998-01-22 None       1998-02-19 男    20-24  桃園市   蘆竹區   None    
 7 1998-01-23 None       1998-02-02 男    40-44  新北市   新店區   None    
 8 1998-01-26 None       1998-02-19 女    65-69  台北市   北投區   None    
 9 1998-02-11 None       1998-02-13 女    25-29  台南市   南區     None    
10 1998-02-16 None       1998-02-24 男    20-24  高雄市   楠梓區   None    
# ℹ 106,851 more rows
# ℹ 18 more variables: 最小統計區 <chr>, 最小統計區中心點X <chr>,
#   最小統計區中心點Y <chr>, 一級統計區 <chr>, 二級統計區 <chr>,
#   感染縣市 <chr>, 感染鄉鎮 <chr>, 感染村里 <chr>, 是否境外移入 <chr>,
#   感染國家 <chr>, 確定病例數 <dbl>, 居住村里代碼 <chr>, 感染村里代碼 <chr>,
#   血清型 <chr>, 內政部居住縣市代碼 <chr>, 內政部居住鄉鎮代碼 <chr>,
#   內政部感染縣市代碼 <chr>, 內政部感染鄉鎮代碼 <chr>

Similar to the previous dataset, a significant portion of the data are in Traditional Chinese. Fortunately, the only columns we need are 發病日, 最小統計區中心點X, and 最小統計區中心點Y, which translates to Onset Date, X coordinates and Y coordinates. Let’s drop the columns that we don’t need and translate the column names to English.

Show code
dengue_clean <- dengue %>%
     select(1,10,11) %>%
  rename(onset_date=發病日,
         X=最小統計區中心點X,
         Y=最小統計區中心點Y) %>%
  filter(!(Y == "None"))

From the date variable, we will also need to extract the epidemiology week and year from the onset_date variable for analysis, then confine it to epidemiology week 31-50, 2023.

Show code
dengue_epiweek <- dengue_clean %>%
  mutate(year = factor(year(onset_date)),
         epiweek = factor(epiweek(onset_date))) %>%
  filter(year == 2023 & 
        (epiweek %in% (31:50)))

Resetting the levels in the factors.

Show code
dengue_epiweek$epiweek <- factor(dengue_epiweek$epiweek)
dengue_epiweek$year <- factor(dengue_epiweek$year)

In order to do spatial analysis, we will need to convert the data to sf dataframe using st_as_sf().

Show code
dengue_sf <- st_as_sf(dengue_epiweek, 
                       coords = c("X", "Y"),
                       crs=4326) %>%
            st_transform(crs = 3829)

Next, in order to analyse these data by their respective village, we will use st_intersection on dengue_sf and tainan_clean to add their respective village to each data point. We will also drop both the onset_date and year variable as we will not need them anymore.

Show code
st_intersection(dengue_sf, tainan_clean)
Simple feature collection with 18816 features and 6 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 201081.8 ymin: 2535316 xmax: 220420.6 ymax: 2555629
Projected CRS: Hu Tzu Shan 1950 / UTM zone 51N
# A tibble: 18,816 × 7
   onset_date year  epiweek VILLCODE  VILLNAME village           geometry
 * <date>     <fct> <fct>   <chr>     <chr>    <chr>          <POINT [m]>
 1 2023-09-14 2023  37      67000350… 青草里   Qingcao (203370.3 2555156)
 2 2023-10-12 2023  41      67000350… 青草里   Qingcao (203370.3 2555156)
 3 2023-08-04 2023  31      67000270… 保安里   Bao'an  (216211.6 2537686)
 4 2023-09-10 2023  37      67000270… 保安里   Bao'an  (217062.4 2537902)
 5 2023-09-11 2023  37      67000270… 保安里   Bao'an  (215275.6 2537869)
 6 2023-09-16 2023  37      67000270… 保安里   Bao'an  (215831.1 2537495)
 7 2023-09-19 2023  38      67000270… 保安里   Bao'an    (215377 2538568)
 8 2023-09-20 2023  38      67000270… 保安里   Bao'an  (215604.8 2538239)
 9 2023-09-27 2023  39      67000270… 保安里   Bao'an  (215831.1 2537495)
10 2023-09-29 2023  39      67000270… 保安里   Bao'an  (216211.6 2537686)
# ℹ 18,806 more rows
Show code
dengue_combined <- st_intersection(dengue_sf, tainan_clean) %>%
  select(6,3,7)

Again, we will check for invalid geometries and missing values.

Show code
test <- st_is_valid(dengue_combined,reason=TRUE)

# Number of invalid geometries
length(which(test!= "Valid Geometry"))
[1] 0
Show code
# Reason
test[which(test!= "Valid Geometry")]
character(0)
Show code
dengue_combined[rowSums(is.na(dengue_combined))!=0,]
Simple feature collection with 0 features and 2 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: Hu Tzu Shan 1950 / UTM zone 51N
# A tibble: 0 × 3
# ℹ 3 variables: village <chr>, epiweek <fct>, geometry <GEOMETRY [m]>

We will count the total number of cases for each village using the following, then appending it back to the village polygon geometry.

Show code
dengue_counts <- dengue_combined %>% 
                  count(village) %>%
                  rename(total_cases = n)
Show code
tainan_clean_cases <- tainan_clean %>%
  st_join(dengue_counts) %>%
  select(4:6) %>%
  rename(village = village.y) %>%
  drop_na()

We will now do the same but with village and epiweek for each group, in order for us to create a spacetime cube.

Show code
tainan_clean_epw <- tainan_clean %>% mutate(epiweek = factor(31,levels = c(31:50)))
village_epiweek <- tainan_clean_epw %>% 
  expand(village,epiweek) 

village_epiweek
# A tibble: 5,160 × 2
   village epiweek
   <chr>   <fct>  
 1 Andong  31     
 2 Andong  32     
 3 Andong  33     
 4 Andong  34     
 5 Andong  35     
 6 Andong  36     
 7 Andong  37     
 8 Andong  38     
 9 Andong  39     
10 Andong  40     
# ℹ 5,150 more rows
Show code
dengue_counts_epiweek <- dengue_combined %>%
  group_by(village,epiweek) %>%
  count() %>%
  rename(total_cases = n)
Show code
village_epiweek_count <- village_epiweek %>% 
  left_join(dengue_counts_epiweek, by=c("village","epiweek")) %>%
  mutate(across(everything(), replace_na, 0)) %>%
  select(1:3)

3.3 Creating Spacetime Cube

We will be creating our spacetime cube for subsequent emerging hot spot analysis.

Show code
dengue_spacetime <- spacetime(village_epiweek_count, tainan_clean,
          .loc_col = "village",
          .time_col = "epiweek")

Ensuring that our spacetime object is TRUE.

Show code
is_spacetime_cube(dengue_spacetime)
[1] TRUE

3.4 Plotting Spatial Data

Before we continue, we will first plot the data points together with their respective village to have a quick overview of our spatial data on hand.

Show code
tmap_mode('view')
tm_shape(tainan_clean) +
  tm_polygons('village') +
tm_shape(dengue_combined) +
  tm_dots(alpha=0.5)

3.5 Exploratory Data Analysis

In order to do geospatial analysis, we first need to have a brief idea of our dataset. In this case, we have some questions related to dengue that we can ask. Before that, let us insert this plot of dengue cases across epiweek for reference.

Show code
ggplot(dengue_combined, aes(x=epiweek)) +
  geom_bar()

3.5.1 Where did it started?

So where did the outbreak got out of control? To answer this question, let’s first look at the spread of dengue in week 31.

Show code
dengue_week31 <- dengue_combined %>%
  filter(epiweek == 31)
Show code
ggplot(dengue_week31, aes(x=village)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title = "Number of dengue cases in Epiweek 31 by Village",
       x = "Village Name",
       y = "Number of dengue cases")

At a glance, we can see that there are several outliers that are already has an extremely high amount of dengue cases. Let’s sort the dataset by number of cases and take a look at them again.

Show code
dengue_week31_top10 <- dengue_week31 %>%
  group_by(village) %>%
  summarise(total_count=n()) %>%
  top_n(10, total_count) %>%
  .$village

ggplot(dengue_week31[dengue_week31$village %in% dengue_week31_top10,], 
       aes(x=village)) + geom_bar() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title = "Trips Destination Distribution by Planning Area",
       x = "Planning Area",
       y = "Number of Trips")

We can see that there is a clear outlier in the graph, the village “Wuwang”. Moving forward, let’s keep that in mind.

3.5.2 Area with the most dengue cases

Show code
dengue_top10 <- dengue_combined %>%
  group_by(village) %>%
  summarise(total_count=n()) %>%
  top_n(10, total_count) %>%
  .$village

ggplot(dengue_combined[dengue_combined$village %in% dengue_top10,], 
       aes(x=village)) + geom_bar() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title = "Top 10 villages with highest number of overall dengue",
       x = "Village",
       y = "Number of dengue cases")

4. Global Spatial Autocorrelation Analysis

In this section, we aim to assess the spatial patterns of dengue cases in Tainan by computing global spatial autocorrelation statistics and conducting a spatial complete randomness test for a comprehensive understanding of the disease distribution.

4.1 Computing Contiguity Spatial Weights

Before we can compute the global spatial autocorrelation statistics for dengue cases in Tainan, we will need to construct a spatial weights of the study area, which is used to define the neighbourhood relationships between the geographical units, in this case the villages. In the following code, there are 2 parts to it - st_contiguity() computes the contiguity weight matrices, while st_weights() assign weights to each neighboring polygon.

Show code
tainan_clean_cases
Simple feature collection with 257 features and 2 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 198147.4 ymin: 2534651 xmax: 221655.9 ymax: 2556729
Projected CRS: Hu Tzu Shan 1950 / UTM zone 51N
First 10 features:
    village total_cases                       geometry
1   Qingcao           2 POLYGON ((203866.5 2555960,...
2    Bao'an          19 POLYGON ((215123.1 2539291,...
3   Chihkan         111 POLYGON ((212263.9 2546464,...
4   Dacheng          29 POLYGON ((211948.4 2544453,...
5  Chengbei           1 POLYGON ((205015.6 2553859,...
6  Chengnan          10 POLYGON ((204553 2554303, 2...
7     Fahua          38 POLYGON ((213068.1 2544770,...
8    Hainan          44 POLYGON ((209616.5 2549009,...
9    Guo'an         112 POLYGON ((210817.8 2549594,...
10    Xixin          65 POLYGON ((210566.2 2553279,...
Show code
wm_q <- tainan_clean_cases %>%
  mutate(nb = st_contiguity(geometry),
         wt = st_weights(nb, style = "W"),
         .before = 1,
         allow_zero = TRUE)

4.2 Moran’s I

The Moran’s I test computes global spatial autocorrelation statistics, and the results of it draws the following conclusion:

  • Moran I > 0 indicates positive spatial autocorrelation (similar values cluster together),

  • Moran I < 0 indicates negative spatial autocorrelation (dissimilar values cluster together),

  • Moran I = 0 indicates no spatial autocorrelation (random distribution).

4.2.1 Moran’s I Test

Show code
global_moran_test(wm_q$total_cases,
                  wm_q$nb,
                  wm_q$wt)

    Moran I test under randomisation

data:  x  
weights: listw    

Moran I statistic standard deviate = 12.709, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.464548385      -0.003906250       0.001358733 
  1. In our case, the Moran I statistic is 0.464548385, which is positive, suggesting the presence of positive spatial autocorrelation.
  2. The p-value is very small (< 2.2e-16), further supporting the rejection of the null hypothesis, which means that there is spatial autocorrelation.

In summary, based on these results, we can conclude that there is a significant positive spatial autocorrelation in the variable total_cases, meaning that similar values tend to be clustered together in space. This make sense when you consider the nature of transmission of dengue.

4.2.2 Monte Carlo Moran’s I

Even though the analytical approach to the Moran’s I analysis benefits from being fast, it may be sensitive to irregularly distributed polygons. A safer approach to hypothesis testing is to run a Monte Carlo simulation.

Show code
set.seed(9999)
bperm<- global_moran_perm(wm_q$total_cases,
                  wm_q$nb,
                  wm_q$wt,
                  nsim = 999)

bperm

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 1000 

statistic = 0.46455, observed rank = 1000, p-value < 2.2e-16
alternative hypothesis: two.sided

The results from the Monte Carlo simulation supports our findings from the previous test.

4.2.3 Computing Moran’s I correlogram

We will be using sp.correlogram() of spdep package to compute a 6-lag spatial correlogram of total_cases using the global spatial autocorrelation used in Moran’s I.

Show code
MI_corr <- sp.correlogram(wm_q$nb, 
                          tainan_clean_cases$total_cases, 
                          order=6, 
                          method="I", 
                          style="W")
plot(MI_corr)

Plotting the output alone might not allow us to provide complete interpretation. This is because not all autocorrelation values are statistically significant. Hence, it is important for us to examine the full analysis report by printing out the analysis results as in the code chunk below.

Show code
print(MI_corr)
Spatial correlogram for tainan_clean_cases$total_cases 
method: Moran's I
           estimate expectation    variance standard deviate Pr(I) two sided
1 (257)  0.46454839 -0.00390625  0.00135873          12.7087       < 2.2e-16
2 (257)  0.21120897 -0.00390625  0.00057767           8.9501       < 2.2e-16
3 (257)  0.02430593 -0.00390625  0.00035991           1.4871       0.1369895
4 (257) -0.05627157 -0.00390625  0.00026671          -3.2064       0.0013439
5 (257) -0.06086713 -0.00390625  0.00021940          -3.8456       0.0001203
6 (257) -0.02228646 -0.00390625  0.00019955          -1.3011       0.1932078
           
1 (257) ***
2 (257) ***
3 (257)    
4 (257) ** 
5 (257) ***
6 (257)    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Findings from the above results suggests significant positive spatial autocorrelation in total_cases across various distance lags (1,2,4, and 5), supporting the conclusion that chances of an area being a high-risk dengue zone is high if their neighbor is one.

4.3 Geary’s C

The Geary’s C test computes global spatial autocorrelation statistics too, and the results of it draws the following conclusion:

  • The Geary C statistic ranges from 0 to 2. A value less than 1 suggests positive spatial autocorrelation, while a value greater than 1 suggests negative spatial autocorrelation.

4.3.1 Geary’s C Test

Show code
global_c_test(wm_q$total_cases,
                  wm_q$nb,
                  wm_q$wt)

    Geary C test under randomisation

data:  x 
weights: listw 

Geary C statistic standard deviate = 11.193, p-value < 2.2e-16
alternative hypothesis: Expectation greater than statistic
sample estimates:
Geary C statistic       Expectation          Variance 
      0.498506810       1.000000000       0.002007416 

Based on the result, we can conclude that there is significant positive spatial autocorrelation for dengue cases between the villages. The low p-value and the observed Geary C statistic being less than 1 provide evidence against the null hypothesis of no spatial autocorrelation, supporting the presence of a positive spatial pattern in terms of total number of dengue cases across villages.

4.3.2 Monte Carlo Geary’s C

We will now be performing the Monte Carlo simulation for Geary’s C test as well.

Show code
set.seed(9999)
bperm<- global_c_perm(wm_q$total_cases,
                  wm_q$nb,
                  wm_q$wt,
                  nsim = 999)
bperm

    Monte-Carlo simulation of Geary C

data:  x 
weights: listw 
number of simulations + 1: 1000 

statistic = 0.49851, observed rank = 1, p-value = 0.001
alternative hypothesis: greater

Though the p-value is slightly higher, the results of the Monte Carlo simulation still supports the findings from above.

4.3.3 Computing Geary’s C Spatial Correlogram

Show code
GC_corr <- sp.correlogram(wm_q$nb, 
                          tainan_clean_cases$total_cases, 
                          order=6, 
                          method="C", 
                          style="W")
plot(GC_corr)

Again, to get a clearer picture, we will print out the analysis report.

Show code
print(GC_corr)
Spatial correlogram for tainan_clean_cases$total_cases 
method: Geary's C
          estimate expectation   variance standard deviate Pr(I) two sided    
1 (257) 0.49850681  1.00000000 0.00200742         -11.1930       < 2.2e-16 ***
2 (257) 0.75952848  1.00000000 0.00102296          -7.5186       5.538e-14 ***
3 (257) 0.96349581  1.00000000 0.00081933          -1.2753         0.20220    
4 (257) 1.04463640  1.00000000 0.00072161           1.6616         0.09659 .  
5 (257) 1.03292931  1.00000000 0.00057708           1.3708         0.17045    
6 (257) 0.98392669  1.00000000 0.00051532          -0.7081         0.47891    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In conclusion, the interpretation is that there is significant positive spatial autocorrelation in the number of dengue cases between the villages below 2 distance lags. This means that area with high number of dengue cases are spatially clustered, but only below 2 distance lag intervals.

5. Local Spatial Autocorrelation Analysis

5.1 Cluster and Outlier Analysis

Local Indicators of Spatial Association or LISA are statistics that evaluate the existence of clusters in the spatial arrangement of a given variable. In this section, we will apply appropriate Local Indicators for Spatial Association (LISA), especially local Moran’I to detect cluster and/or outlier from dengue cases in Tainan.

5.1.1 Computing local Moran’s I at the village level

In the code below, we will be computing the Local Moran I’s of total number of dengue cases at the village level by using local_moran() of sfdep package.

Show code
set.seed(9999)
lisa <- wm_q %>%
  mutate(local_moran = local_moran(
    total_cases, nb, wt, nsim = 99),
    .before = 1) %>%
  unnest(local_moran)

5.1.2 Mapping the local Moran’s I

To better understand the results of the computed values, let’s map them out.

Show code
tmap_mode('plot')
map1 <- tm_shape(lisa) +
  tm_fill("ii",
          style = "pretty",
          palette = "RdBu") + 
  tm_borders(alpha = 0.5) +
  tm_layout(main.title = "local Moran's I of total_cases",
            main.title.size = 0.8)
map1

By mapping out the local Moran’s I, we can see that area with the highest values are Xiqi village and Anfu village.

The choropleth map shows that there is evidence for both positive and negative Ii values. However, let us plot the map for local Moran’s I p-values as well.

5.1.3 Mapping local Moran’s I p-values

Show code
tmap_mode('plot')
map2 <- tm_shape(lisa) +
  tm_fill("p_ii_sim",
          style = "pretty",
          palette = "GnBu") + 
  tm_borders(alpha = 0.5) +
  tm_layout(main.title = "p-value of local Moran's I",
            main.title.size = 0.8)

map2

For a holistic view of the results, let us plot the both map side by side.

Show code
tmap_arrange(map1, map2, ncol=2)

Results from this comparison also shows that the findings regarding Xiqi village and Anfu village is statistically significant.

5.2. Creating a LISA Cluster Map

LISA map is a categorical map showing outliers and clusters. There are two types of outliers namely: High-Low and Low-High outliers. Likewise, there are two type of clusters namely: High-High and Low-Low cluaters. In fact, LISA map is an interpreted map by combining local Moran’s I of geographical areas and their respective p-values. First, let’s filter out parts of the LISA map that is statistically significant at 5%.

Show code
lisa_sig <- lisa %>%
  filter(p_ii < 0.05)

In short, here are the interpretation of each classification:

  • High-Low - High number of cases with neighbors that have low number of cases.

  • High-High - High number of cases with neighbors that have high number of cases.

  • Low-Low - Low number of cases with neighbors that have low number of cases.

  • Low-High - Low number of cases with neighbors that have high number of cases.

Show code
tmap_mode("plot")
tm_shape(lisa) +
  tm_polygons() +
  tm_borders(alpha = 0.5) +
tm_shape(lisa_sig) +
  tm_fill("mean") +
  tm_borders(alpha = 0.4)

From the above, we can see around 3 clusters of statistically significant high-high zones, with several low-highs in the mix.

5.3 Hot Spot and Cold Spot Area Analysis

Next, we will use spatial weights to identify locations of statistically significant hot spots and cold spots in an spatially weighted attribute that are in proximity to one another based on a calculated distance. The analysis groups features when similar high (hot) or low (cold) values are found in a cluster.

5.3.1 Computing local Gi statistics

The first step is to derive a spatial weight matrix using the steps used previously.

Show code
wm_idw <- tainan_clean_cases %>%
  mutate(nb = st_contiguity(geometry),
         wts = st_inverse_distance(nb, geometry,
                                   scale = 1,
                                   alpha = 1),
         .before = 1)

wm_idw
Simple feature collection with 257 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 198147.4 ymin: 2534651 xmax: 221655.9 ymax: 2556729
Projected CRS: Hu Tzu Shan 1950 / UTM zone 51N
First 10 features:
                                             nb
1                                   6, 118, 160
2                       126, 128, 138, 167, 221
3          68, 69, 171, 180, 183, 184, 187, 199
4                    94, 97, 100, 104, 181, 206
5                              12, 13, 248, 254
6                      1, 12, 13, 118, 160, 248
7                               54, 98, 99, 200
8  9, 73, 75, 115, 125, 144, 156, 157, 165, 185
9                         8, 110, 115, 125, 165
10                  11, 159, 161, 165, 235, 257
                                                                                                                                          wts
1                                                                                                    0.0003765271, 0.0005151173, 0.0003023114
2                                                                        0.0003316968, 0.0004825468, 0.0003946908, 0.0007680351, 0.0008760837
3                              0.0012455218, 0.0015715215, 0.0019993121, 0.0009593535, 0.0011156322, 0.0013307060, 0.0026978710, 0.0016697290
4                                                                0.001269018, 0.002958948, 0.001816140, 0.002190554, 0.001030674, 0.001028848
5                                                                                      0.0014165165, 0.0011386957, 0.0006321824, 0.0004119013
6                                                          0.0003765271, 0.0007064605, 0.0004467492, 0.0003116941, 0.0003297692, 0.0003869143
7                                                                                          0.002877583, 0.001581528, 0.001416387, 0.002849993
8  0.0004109815, 0.0004262253, 0.0003339108, 0.0005660110, 0.0003345248, 0.0004394663, 0.0007742001, 0.0002962680, 0.0003410048, 0.0009216606
9                                                                        0.0004109815, 0.0012146325, 0.0003653165, 0.0010248951, 0.0007481458
10                                                         0.0009742478, 0.0005550687, 0.0004554816, 0.0004543659, 0.0008942098, 0.0005811834
    village total_cases                       geometry
1   Qingcao           2 POLYGON ((203866.5 2555960,...
2    Bao'an          19 POLYGON ((215123.1 2539291,...
3   Chihkan         111 POLYGON ((212263.9 2546464,...
4   Dacheng          29 POLYGON ((211948.4 2544453,...
5  Chengbei           1 POLYGON ((205015.6 2553859,...
6  Chengnan          10 POLYGON ((204553 2554303, 2...
7     Fahua          38 POLYGON ((213068.1 2544770,...
8    Hainan          44 POLYGON ((209616.5 2549009,...
9    Guo'an         112 POLYGON ((210817.8 2549594,...
10    Xixin          65 POLYGON ((210566.2 2553279,...

Now let us calculate the local Gi statistics.

Show code
HCSA <- wm_idw %>%
  mutate(local_Gi = local_gstar_perm(
    total_cases, nb, wts, nsim = 99),
      .before = 1) %>%
  unnest(local_Gi)
HCSA
Simple feature collection with 257 features and 12 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 198147.4 ymin: 2534651 xmax: 221655.9 ymax: 2556729
Projected CRS: Hu Tzu Shan 1950 / UTM zone 51N
# A tibble: 257 × 13
   gi_star    e_gi   var_gi p_value   p_sim p_folded_sim skewness kurtosis nb   
     <dbl>   <dbl>    <dbl>   <dbl>   <dbl>        <dbl>    <dbl>    <dbl> <nb> 
 1  -2.35  0.00281  1.65e-6 -1.99   0.0467          0.02     0.01    1.25  <int>
 2  -2.07  0.00332  1.36e-6 -1.73   0.0835          0.02     0.01    0.604 <int>
 3   0.365 0.00417  1.18e-6  0.0820 0.935           0.9      0.45    0.481 <int>
 4   0.363 0.00353  1.13e-6  0.739  0.460           0.5      0.25    0.482 <int>
 5  -2.65  0.00302  1.42e-6 -2.34   0.0191          0.02     0.01    0.306 <int>
 6  -3.17  0.00354  1.53e-6 -2.69   0.00714         0.02     0.01    0.976 <int>
 7  -1.25  0.00360  1.63e-6 -1.12   0.263           0.32     0.16    0.304 <int>
 8  -0.280 0.00401  9.45e-7 -0.384  0.701           0.72     0.36    0.436 <int>
 9   1.53  0.00416  1.51e-6  1.34   0.179           0.16     0.08    0.587 <int>
10  -1.47  0.00394  1.03e-6 -1.73   0.0838          0.04     0.02    0.350 <int>
# ℹ 247 more rows
# ℹ 4 more variables: wts <list>, village <chr>, total_cases <int>,
#   geometry <POLYGON [m]>

5.3.2 Visualising Gi values

Show code
tmap_mode("plot")
map3 <- tm_shape(HCSA) +
  tm_fill("gi_star") +
  tm_borders(alpha = 0.5) +
  tm_layout(main.title = "Gi* of dengue cases",
            main.title.size = 0.8) +
  tm_view(set.zoom.limits = c(6,8))

map3

To interpret the map, greater value in either direction represents stronger clustering, with positive values being hot clusters and negative values being cold clusters. In the Gi value plot above, the hot spot area identified in previous section extends to Da’an village as well, with Mingliang village down south and Sanhe village to the east. Meanwhile when looking at cold spots, the most notable area would be the stretch up north from Chengxi village to Chang’an village.

5.3.3 Visualising local HCSA with p-values

To get an effective comparison, let’s plot it against the p-values.

Show code
map4 <-
  tm_shape(HCSA) +
  tm_fill("p_sim",
          breaks = c(0, 0.001, 0.01, 0.05, 1),
          labels = c("0.01", "0.01", "0.05", "Not sig.")) +
  tm_borders(alpha = 0.5) +
  tm_layout(main.title = "p-value of Gi*",
            main.title.size = 0.8) +
  tm_view(set.zoom.limits = c(6,8))

tmap_arrange(map3, map4, ncol = 2)

p-value map though seems to contain a mixed response, with places identified previously like Chengxi village being classified as not significant.

5.3.4 Visualising hot spot and cold spot areas

Now, let us plot the significant hot and cold spot areas of dengue cases.

Show code
HCSA_sig <- HCSA %>% filter(p_sim <0.05)

tmap_mode("plot")
tm_shape(HCSA) +
  tm_polygons() + 
  tm_borders(alpha = 0.5) +
  tm_shape(HCSA_sig) +
  tm_fill("gi_star") + 
  tm_borders(alpha = 0.4)

By combining the statistics together, we can see that the significant hot spots can still be identified, but the northwest portion of the most notable cold spot is not significant.

6. Emerging Hotspot Analysis

Emerging Hot Spot Analysis (EHSA) is a spatio-temporal analysis method for revealing and describing how hot spot and cold spot areas evolve over time. In our case, we will employ EHSA to investigate the spatio-temporal dynamics of dengue cases in the Tainan region during the period from EpiWeek 31 to 50 in the year 2023, shedding light on the evolving patterns of hot spots and cold spots over this specific timeframe.

6.1 Computing Gi*

We will first identify neighbors and derive their inverse distance weights. Recall that we created a spacetime cube object earlier, dengue_spacetime.

Show code
dengue_nb <- dengue_spacetime %>%
  activate("geometry") %>%
  mutate(nb = include_self(st_contiguity(geometry)),
         wt = st_inverse_distance(nb, geometry,
                                  scale = 1,
                                  alpha = 1),
         .before = 1) %>%
  set_nbs("nb") %>%
  set_wts("wt")

Using glimpse() to check out the data reveals the neighbors and weights for each time-slice for each village.

Show code
glimpse(dengue_nb)
Rows: 5,160
Columns: 5
$ village     <chr> "Qingcao", "Bao'an", "Chihkan", "Dacheng", "Chengbei", "Ch…
$ epiweek     <fct> 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31…
$ total_cases <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2, 34, …
$ nb          <list> <1, 6, 118, 160>, <2, 126, 128, 138, 168, 222>, <3, 68, 6…
$ wt          <list> <0.0000000000, 0.0003765271, 0.0005151173, 0.0003023114>,…

6.2 Computing Gi* for each location

With that, we can use these new columns to manually calculate the local Gi* for each location. by grouping by Epiweek and using local_gstar_perm() of sfdep package. After which, we use unnest() to unnest gi_star column of the newly created gi_starts dataframe.

Show code
gi_stars <- dengue_nb %>%
  group_by(epiweek) %>%
  mutate(gi_star = local_gstar_perm(
    total_cases, nb, wt)) %>%
  tidyr::unnest(gi_star)

6.3 Mann-Kendall Test

With our Gi* statistics in hand, let us take a deeper look at each location and evaluate for a trend using Mann-Kendall test. Let us first try it on Xiqi village, one of the hot spots identified in our local spatial autocorrelation analysis above.

Show code
xiqi_mkt <- gi_stars %>%
  ungroup %>%
  filter(village == "Xiqi") %>%
  select(village, epiweek, gi_star)
Show code
ggplot(data = xiqi_mkt,
       aes(x=epiweek, y=gi_star)) +
  geom_line(group = 1) +
  theme_light()

The Mann-Kendall test shows that Xiqi village is an emerging hot spot from week 31 peaking at week 34, before slowly cooling down and hovering around the average range from week 39 onwards.

To replicate this for each village, we will use group_by() function.

Show code
ehsa <- gi_stars %>%
  group_by(village) %>%
  summarise(mk = list(
    unclass(
      Kendall::MannKendall(gi_star)))) %>%
  tidyr::unnest_wider(mk)

Next, we will arrange it to show significant emerging hot and cold spots.

Show code
emerging <- ehsa %>%
  arrange(sl, abs(tau)) %>%
  slice(1:5)

6.3 Performing Emerging Hotspot Analysis

Lastly, we will perform EHSA analysis by using emerging_hotspot_analysis() of sfdep package, which takes the spacetime cube object dengue_spacetime and the name of the variable of interest, in this case total_cases.

Show code
ehsa <- emerging_hotspot_analysis(
  x = dengue_spacetime,
  .var = "total_cases",
  k = 1,
  nsim = 99
)
Show code
ehsa
# A tibble: 258 × 4
   location     tau   p_value classification     
   <chr>      <dbl>     <dbl> <chr>              
 1 Qingcao   0.611  0.000191  oscilating hotspot 
 2 Bao'an   -0.453  0.00582   oscilating coldspot
 3 Chihkan   0.621  0.000147  oscilating hotspot 
 4 Dacheng   0.358  0.0297    sporadic hotspot   
 5 Chengbei  0.611  0.000191  oscilating hotspot 
 6 Chengnan  0.284  0.0855    sporadic coldspot  
 7 Fahua     0.337  0.0410    oscilating hotspot 
 8 Hainan    0.0421 0.820     oscilating hotspot 
 9 Guo'an    0.716  0.0000119 oscilating hotspot 
10 Xixin     0.379  0.0212    oscilating hotspot 
# ℹ 248 more rows

6.4 Visualising the distribution of EHSA classes

We will then use ggplot2 to visualise the distribution of EHSA classes.

Show code
ggplot(data = ehsa,
       aes(x = classification)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title = "Distribution of EHSA classes",
       x = "EHSA classes",
       y = "Count")

Here are the definitions of each notable EHSA classes, courtesy of Prof Kam’s slides:

  • Oscillating Hot spots - A statistically significant hot spot for the final time-step interval that has a history of also being a statistically significant cold spot during a prior time step. Less than 90 percent of the time-step intervals have been statistically significant hot spots.

  • Sporadic Hot spots - A statistically significant hot spot for the final time-step interval with a history of also being an on-again and off-again hot spot. Less than 90 percent of the time-step intervals have been statistically significant hot spots and none of the time-step intervals have been statistically significant cold spots.

The cold spots version would just be the vice versa of the description above.

By visualising the distribution, we can see that majority of the distribution belongs to the oscillating hot/cold spots, with some being sporadic hot/cold spots. This perhaps signify that the local authorities are doing a relatively good job at controlling dengue outbreaks, as dengue hot spots tend to not stay for long.

6.5 Visualising EHSA

Now that we have an idea of the distribution of EHSA classes, let’s visualise the geographic distribution instead. But before we can do that, let us first combine tainan_clean with their EHSA classes.

Show code
tainan_ehsa <- tainan_clean %>%
  left_join(ehsa,
            by = join_by(village == location)) %>%
  select(3:7)
Show code
ehsa_sig <- tainan_ehsa %>%
  filter(p_value < 0.05)

tmap_mode("plot")
tm_shape(tainan_ehsa) +
  tm_polygons() +
  tm_borders(alpha = 0.5) +
tm_shape(ehsa_sig) +
  tm_fill("classification") +
  tm_borders(alpha = 0.4)

7. Conclusion

Show code
tmap_mode("view")
tm_shape(tainan_clean_cases) +
  tm_polygons("village") +
  tmap_options(max.categories = 257)

In our Global Spatial Autocorrelation Analysis, we’ve discovered through Global Moran’s I test and Geary’s test, backed by Monte Carlo simulation of each, that there is a strong positive spatial autocorrelation in terms of dengue cases count, and the results are statistically significant. This shows that dengue outbreaks are no coincidence, and a village with high number of dengue cases means that the neighboring villages have a high probability of having a dengue outbreak too. Considering the nature of how dengue fever is transmitted, this would be a fair conclusion.

Taking a closer look at the village level, our Local Spatial Autocorrelation Analysis reveals both areas with far above average number of dengue cases and those that are far below, in short, outliers. By comparing local Moran’s I values and their respective p-values, we have discovered that villages like there is a cluster of villages with abnormally higher number of cases, mainly Xiqi and Anfu village.

When we look at LISA cluster map and hot spot/cold spot area analysis, the results further identify these 2 region as high-highs and hot spot area respectively, reinforcing the fact that they are indeed has been a dengue red zone during the specified time period, in Singapore’s terms. Another 2 hot spots revealed by these analysis is Mingliang village down south and Sanhe village to the east.

However, differing from only looking at local Moran’s I values, these 2 named analysis also reveals low-lows or cold spots, which are areas that are clustered together, that have a significantly low number of dengue cases. The most noticeable at a glance would be the cluster of cold spots up north, stretching from Chengxi village to Chang’an village. Despite that, the area closer to Chengxi village seems to not be statistically significant leaving only the region directly centre-north as the main cluster of cold spots.

Lastly, our findings from Emerging Hotspot Analysis (EHSA) reveals that the majority of the villages including areas identified like Xiqi, Anfu, Mingliang, are oscillating hot spots, signifying that the outbreak is only towards the later part of the time frame. The only exception is Sanhe being an oscillating cold spot, which could indicate that the dengue broke out here first as compared to the villages mentioned above.

Overall, our findings from this assignment suggests that 1) dengue tends to break out in clusters and the neighboring areas of a village with dengue outbreak has a high risk of being the next, due to the nature of how dengue is transmitted and 2) outbreaks are generally well-contained by the local authorities as the number of villages that are consecutive hotspots are close to none.

8. Reflection

The materials provided in class was very thorough and well-explained, as I am able to complete around 70% of this assignment following hands-on exercises, in-class exercises, and lesson slides alone. The other 30% would come from searching up documentations of functions that we’ve never tried before in class like doing geary_test using sfdep package, or understanding the general idea and the concept behind a spacetime cube object. Overall, this assignment allows me to better appreciate the beauty of what stories could be told from data alone, especially when using right and efficient tools like geospatial visualisations and statistical analysis, with the formal being the main tool of this module. I am immensely satisfied with what I’ve learned through these assignments, and I am looking forward to what I can pick up next in these hands-on approaches.